7 research outputs found

    Data Collection and Machine Learning Methods for Automated Pedestrian Facility Detection and Mensuration

    Get PDF
    Large-scale collection of pedestrian facility (crosswalks, sidewalks, etc.) presence data is vital to the success of efforts to improve pedestrian facility management, safety analysis, and road network planning. However, this kind of data is typically not available on a large scale due to the high labor and time costs that are the result of relying on manual data collection methods. Therefore, methods for automating this process using techniques such as machine learning are currently being explored by researchers. In our work, we mainly focus on machine learning methods for the detection of crosswalks and sidewalks from both aerial and street-view imagery. We test data from these two viewpoints individually and with an ensemble method that we refer to as our “dual-perspective prediction model”. In order to obtain this data, we developed a data collection pipeline that combines crowdsourced pedestrian facility location data with aerial and street-view imagery from Bing Maps. In addition to the Convolutional Neural Network used to perform pedestrian facility detection using this data, we also trained a segmentation network to measure the length and width of crosswalks from aerial images. In our tests with a dual-perspective image dataset that was heavily occluded in the aerial view but relatively clear in the street view, our dual-perspective prediction model was able to increase prediction accuracy, recall, and precision by 49%, 383%, and 15%, respectively (compared to using a single perspective model based on only aerial view images). In our tests with satellite imagery provided by the Mississippi Department of Transportation, we were able to achieve accuracies as high as 99.23%, 91.26%, and 93.7% for aerial crosswalk detection, aerial sidewalk detection, and aerial crosswalk mensuration, respectively. The final system that we developed packages all of our machine learning models into an easy-to-use system that enables users to process large batches of imagery or examine individual images in a directory using a graphical interface. Our data collection and filtering guidelines can also be used to guide future research in this area by establishing standards for data quality and labelling

    Protein Residue-Residue Contact Prediction Using Stacked Denoising Autoencoders

    Get PDF
    Protein residue-residue contact prediction is one of many areas of bioinformatics research that aims to assist researchers in the discovery of structural features of proteins. Predicting the existence of such structural features can provide a starting point for studying the tertiary structures of proteins. This has the potential to be useful in applications such as drug design where tertiary structure predictions may play an important role in approximating the interactions between drugs and their targets without expending the monetary resources necessary for preliminary experimentation. Here, four different methods involving deep learning, support vector machines (SVMs), and direct coupling analysis were trained on a dataset of proteins from the 9th Critical Assessment of Techniques for Protein Structure Prediction (CASP 9). The models that were the most successful after training on the CASP 9 data were selected to perform the contact predictions in each method. After performing a blind test on CASP 11 targets, we have determined that further optimizations to the training process may be necessary to improve performance

    Predicting Protein Residue-Residue Contacts Using Random Forests and Deep Networks

    Get PDF
    Background: The ability to predict which pairs of amino acid residues in a protein are in contact with each other offers many advantages for various areas of research that focus on proteins. For example, contact prediction can be used to reduce the computational complexity of predicting the structure of proteins and even to help identify functionally important regions of proteins. These predictions are becoming especially important given the relatively low number of experimentally determined protein structures compared to the amount of available protein sequence data. Results: Here we have developed and benchmarked a set of machine learning methods for performing residue-residue contact prediction, including random forests, direct-coupling analysis, support vector machines, and deep networks (stacked denoising autoencoders). These methods are able to predict contacting residue pairs given only the amino acid sequence of a protein. According to our own evaluations performed at a resolution of +/− two residues, the predictors we trained with the random forest algorithm were our top performing methods with average top 10 prediction accuracy scores of 85.13% (short range), 74.49% (medium range), and 54.49% (long range). Our ensemble models (stacked denoising autoencoders combined with support vector machines) were our best performing deep network predictors and achieved top 10 prediction accuracy scores of 75.51% (short range), 60.26% (medium range), and 43.85% (long range) using the same evaluation. These tests were blindly performed on targets from the CASP11 dataset; and the results suggested that our models achieved comparable performance to contact predictors developed by groups that participated in CASP11. Conclusions: Due to the challenging nature of contact prediction, it is beneficial to develop and benchmark a variety of different prediction methods. Our work has produced useful tools with a simple interface that can provide contact predictions to users without requiring a lengthy installation process. In addition to this, we have released our C++ implementation of the direct-coupling analysis method as a standalone software package. Both this tool and our RFcon web server are freely available to the public at http://dna.cs.miami.edu/RFcon/

    Deep Learning-Based Structure-Activity Relationship Modeling for Multi-Category Toxicity Classification: A Case Study of 10K Tox21 Chemicals With High-Throughput Cell-Based Androgen Receptor Bioassay Data

    Get PDF
    Deep learning (DL) has attracted the attention of computational toxicologists as it offers a potentially greater power for in silico predictive toxicology than existing shallow learning algorithms. However, contradicting reports have been documented. To further explore the advantages of DL over shallow learning, we conducted this case study using two cell-based androgen receptor (AR) activity datasets with 10K chemicals generated from the Tox21 program. A nested double-loop cross-validation approach was adopted along with a stratified sampling strategy for partitioning chemicals of multiple AR activity classes (i.e., agonist, antagonist, inactive, and inconclusive) at the same distribution rates amongst the training, validation and test subsets. Deep neural networks (DNN) and random forest (RF), representing deep and shallow learning algorithms, respectively, were chosen to carry out structure-activity relationship-based chemical toxicity prediction. Results suggest that DNN significantly outperformed RF (p \u3c 0.001, ANOVA) by 22–27% for four metrics (precision, recall, F-measure, and AUPRC) and by 11% for another (AUROC). Further in-depth analyses of chemical scaffolding shed insights on structural alerts for AR agonists/antagonists and inactive/inconclusive compounds, which may aid in future drug discovery and improvement of toxicity prediction modeling

    An Autoencoder-Based Deep Learning Method For Genotype Imputation

    Get PDF
    Genotype imputation has a wide range of applications in genome-wide association study (GWAS), including increasing the statistical power of association tests, discovering trait-associated loci in meta-analyses, and prioritizing causal variants with fine-mapping. In recent years, deep learning (DL) based methods, such as sparse convolutional denoising autoencoder (SCDA), have been developed for genotype imputation. However, it remains a challenging task to optimize the learning process in DL-based methods to achieve high imputation accuracy. To address this challenge, we have developed a convolutional autoencoder (AE) model for genotype imputation and implemented a customized training loop by modifying the training process with a single batch loss rather than the average loss over batches. This modified AE imputation model was evaluated using a yeast dataset, the human leukocyte antigen (HLA) data from the 1,000 Genomes Project (1KGP), and our in-house genotype data from the Louisiana Osteoporosis Study (LOS). Our modified AE imputation model has achieved comparable or better performance than the existing SCDA model in terms of evaluation metrics such as the concordance rate (CR), the Hellinger score, the scaled Euclidean norm (SEN) score, and the imputation quality score (IQS) in all three datasets. Taking the imputation results from the HLA data as an example, the AE model achieved an average CR of 0.9468 and 0.9459, Hellinger score of 0.9765 and 0.9518, SEN score of 0.9977 and 0.9953, and IQS of 0.9515 and 0.9044 at missing ratios of 10% and 20%, respectively. As for the results of LOS data, it achieved an average CR of 0.9005, Hellinger score of 0.9384, SEN score of 0.9940, and IQS of 0.8681 at the missing ratio of 20%. In summary, our proposed method for genotype imputation has a great potential to increase the statistical power of GWAS and improve downstream post-GWAS analyses

    A Review On Machine Learning Methods For \u3ci\u3eIn Silico\u3c/i\u3e Toxicity Prediction

    No full text
    In silico toxicity prediction plays an important role in the regulatory decision making and selection of leads in drug design as in vitro/vivo methods are often limited by ethics, time, budget, and other resources. Many computational methods have been employed in predicting the toxicity profile of chemicals. This review provides a detailed end-to-end overview of the application of machine learning algorithms to Structure-Activity Relationship (SAR)-based predictive toxicology. From raw data to model validation, the importance of data quality is stressed as it greatly affects the predictive power of derived models. Commonly overlooked challenges such as data imbalance, activity cliff, model evaluation, and definition of applicability domain are highlighted, and plausible solutions for alleviating these challenges are discussed
    corecore